Explainer notebook - The Foods-Network¶
Contribution statement:¶
Team members:
- Jacob (s214596)
- Kristoffer (s214609)
- Karoline (s214638)
All members collaborated and contributed to every part of the assignment.
Relevant Links:¶
A project website has been deployed to present the findings caried out in the study.
Specifics can be found in the following Github repository.
Below is all relevant packages used to complete this project:
import pandas as pd
import matplotlib.pyplot as plt
from itertools import product
import seaborn as sns
import numpy as np
from wordcloud import WordCloud
import ast
import requests
import altair as alt
from tqdm import tqdm
import networkx as nx
import json
from networkx.readwrite import json_graph
import vl_convert as vlc
import random
import matplotlib.colors as mcolors
from netwulf import visualize, draw_netwulf
import community
from wordcloud import WordCloud
from joblib import Parallel, delayed
import os # This is imported in order to store API key as environment variable
from openai import OpenAI
import random
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
from itertools import chain
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pandas as pd
Table of Contents¶
1. Motivation ¶
1.1 Motivation and selection of data for the study ¶
This project aims to uncover valuable insights into consumer behavior in Denmark by investigating a network of grocery items frequently purchased together in some of the country’s largest and most well-known grocery stores owned by the Salling Group. Understanding consumer behavior is essential, as it includes a broad and important field that integrates psychology, sociology, economics, and marketing [1]. In this project we will investigate this network using a social science approach in order to discover how consumer choices are reflected in what products gets bought together.
The data used in the project includes an id, name, price and a short description of common everyday food-products that is collected using scraping methods on the BilkaToGo website. The project utilizes the Salling Group API to gather information about items that are frequently bought together by calling for each product id. This particular data was chosen in order to create a network of food-items where an edge between two products occur if they are on the same 'frequently bough together' list. The project will conduct a thoughrough analysis of the network in order to get insights of the networks structure and investigate which types of food-items often land in the same basket and what kind og communities emerge based on what the average consumer decides to buy together when shopping.
The descriptions collected for each product are a small text-pieces of in average around 60 words typically explaining what the product tastes like, and what occasions/setting they are typically enjoyed in. This data is used to conduct a textual analysis using NLP tools to get insights into the communities and label them according to the information they hold.
1.2 The goal for the end user’s experience ¶
The Boston Consulting Group mentiones in a Danish consumer sentiment series that in 2023 "health was the most important purchasing criterion, with 42% of respondents considering health important when buying food." [2]. Additionally they mention that the sustainability trend is prominent in the groceries category, with 21% of consumers finding sustainability important when shopping for groceries. Trough the analysis this project aims to investigate if some of these patterns are recognisable in the network and shed some light on what types of products generally tend to get puchased together. Subsequently a goal is that the analysis will be reflecting consumer behaviors that are recognized in our daily life but also discover patterns we might hadn't initially though of.
2.1 Data collection and Webscraping ¶
To analyze consumer patterns, we need data of different items sold by the Salling Group. We utilize the Salling Group API specifically the frequently bought together endpoint. This requires product IDs, which is web scraped from the from the BilkaToGO where they originate. Because we need 3 types of data from the products, we have divided our webscraping into 3 different parts:
- Webscraping for productID's
- Webscraping for individual product descriptions
- Frequently bought together endpoint call
The BilkaToGO website is structured in a way that all categories fall into one of 21 categories that each has a dedicated page. Initially, we looped through each category link to scrape all products on these pages. However, the site's heavy reliance on JavaScript, tied to a backend database, made web scraping challenging with a more tideous and longsome process. To overcome this, we used the Python package 'Selenium' with a webdriver to interact with the pages. This process required opening each page and using CSS selectors to scrape the data.
This process took a long time to figure out and to run in order to get all productID's for the website. We uncovered ~32.000 products and for each page there are ~60 products. This means that we initialized the webdriver for each product page around 533 times for the first part. A thing to note about the dataset, is that the website is tied to a backend database and grocery prices and product change a lot over time, so the dataset of the products are specific for the date and time we scraped it. (The 17th of April 2024)
When running through this first process of datacollecting, we also gathered data about the name of the product, price, link and category.
The second part of the data collection was to gather the textual data for each product in form of a product description tied to each individual product.
1. Webscraping for productID's¶
BASE_LINK = "https://www.bilkatogo.dk"
def extract_products(url, category_list):
products_df = pd.DataFrame(columns=['product_id', 'name', 'price', 'link', 'category'])
for category in tqdm(category_list):
page_counter = 0
SCRAPE_LINK = f"{url}/{category}" # Construct the URL for the first case with no page number
product_card_containers = ["dummy"]
while len(product_card_containers) != 0:
print("The length of product_card_containers is: ", len(product_card_containers))
print("The page counter is: ", page_counter)
# Initialize Edge WebDriver
options = webdriver.EdgeOptions()
options.use_chromium = True
options.add_argument('headless') # To run Edge in headless mode
driver = webdriver.Edge(options=options)
# Load the webpage
driver.get(SCRAPE_LINK)
# Wait for a few seconds to ensure JavaScript execution
time.sleep(0.5)
# Get the page source after JavaScript execution
page_source = driver.page_source
# Close the WebDriver
driver.quit()
# Parse the HTML content
soup = BeautifulSoup(page_source, 'html.parser')
# Find the div element with specific attributes
div_element = soup.find_all('div', {'data-v-da5161c2': True, 'data-v-e0535ac4': True})
# Find all elements with product-card-container class
product_card_containers = soup.find_all('div', {'class' : 'product-card-container'})
#TODO:
for product in product_card_containers:
if product == 'dummy':
continue
product_info = product.find('div')
product_id = product_info.attrs['data-productid']
product_name = product_info.contents[0]['aria-label']
product_link = product_info.contents[0]['href']
prod_price = product_info.find('div', {'class' : 'row product-description flex-column'})
produc_price = prod_price.find('p', {'class' : 'description'})
### Added a logic for 'drikkevarer' as the category doesn't have the same price format as the others and follows a slight random pattern when displaying the price ###
if category == "drikkevarer/":
for span in produc_price.find_all('span'):
if "/L." in span.text:
product_kg_price = span.string
else:
product_kg_price = produc_price.find_all('span')[-1].string # Extract the kg./stk. price from the product. Sufficient for all categories except 'drikkevarer'
#price = [int(s) for s in test.split() if s.isdigit()]
new_row = {'product_id' : product_id, 'name' : product_name, 'link' : product_link, 'price' : product_kg_price, 'category' : category}
products_df.loc[len(products_df)] = new_row
page_counter += 1
SCRAPE_LINK = f"{url}/{category}/?page={page_counter}"
return products_df
LINK = "https://www.bilkatogo.dk/kategori/"
categories = ["frugt-og-groent/", "koed-og-fisk/", "mejeri-og-koel/", "drikkevarer/", "broed-og-kager/", "kolonial/", "slik-og-snacks/", "frost/", "kiosk/", "dyremad/", "husholdning/",
"personlig-pleje/", "baby-og-boern/", "bolig-og-koekken/", "fritid-og-sport/", "toej-og-sko/", "elektronik/", "have/", "leg/", "biludstyr/", "byggemarked/"]
prods = extract_products(LINK, ["drikkevarer/"])
prods.to_csv('data/df_Salling_Products.csv', sep=';', index=False, header=False)
2. Webscraping for individual product descriptions¶
BASE_LINK = "https://www.bilkatogo.dk"
df1 = pd.DataFrame(columns=["p_id", "descriptions"])
def get_product_description(link):
product_descriptions = []
#for link in product_links:
url = BASE_LINK + link
options = webdriver.EdgeOptions()
options.use_chromium = True
options.add_argument('headless') # To run Edge in headless mode
driver = webdriver.Edge(options=options)
# Load the webpage
driver.get(url)
# Wait for a few seconds to ensure JavaScript execution
time.sleep(2)
# Get the page source after JavaScript execution
page_source = driver.page_source
# Close the WebDriver
driver.quit()
# Parse the HTML content
soup = BeautifulSoup(page_source, 'html.parser')
try:
description = soup.find('section', {'id' : 'content-description', 'class': 'content'})
a = description.find("h2")
if a != None:
description = ''.join([str(tag) for tag in reversed(list(description.h2.previous_siblings)) if tag.name != 'h2'])
description = description.replace("<br/><br/>","")
else:
description = description.text
except AttributeError:
description = "No description available"
product_descriptions.append(description)
return product_descriptions
products = pd.read_csv('data/df_Salling_Products.csv', sep=";")
product_links = products["link"]
product_id = products["product_id"]
for i in tqdm(range(len(pd.read_csv('data\df_Salling_Products_Descriptions_CLEANED.csv', sep=";")), len(product_links), 14)):
results = Parallel(n_jobs=14)(delayed(get_product_description)(link) for link in product_links[i:i+14])
df1['p_id'] = list(product_id[i:i+14])
df1["descriptions"] = sum(results, [])
df1.to_csv('data/df_Salling_Products_Descriptions_CLEANED.csv', sep=';', mode='a', index=False, header=False)
3. Frequently bought together endpoint call¶
df2 = pd.read_csv("data/df_Salling_Products_outer_categories.csv", sep=";")
df2 = list(df2.loc[df2["outer_category"] == "Foods"]["product_id"])
bearer_token = "6656b12a-84c5-48df-a174-4b3c1cd38614"
class BearerAuth(requests.auth.AuthBase):
def __init__(self, token):
self.token = token
def __call__(self, r):
r.headers["authorization"] = "Bearer " + self.token
return r
df_neighbor = pd.DataFrame(columns=["product_id", "neighbor_products_id"])
def get_neighbours(product_id):
neighbour_id = []
url = "https://api.sallinggroup.com"
version = "v1-beta"
recourses = "product-suggestions/frequently-bought-together"
PARAMETERS = {
"productId" : f"{product_id}"
}
API_LINK = f"{url}/{version}/{recourses}"
results = requests.get(API_LINK, params = PARAMETERS, auth=BearerAuth(bearer_token)).json()
for neighbours in results:
try:
neighbour_id.append(int(neighbours['prod_id']))
except:
print("Error")
if results['statusCode'] == 429:
print("Sleeping the code... Error 429")
time.sleep(20)
# # except TypeError:
# # print("Sleeping the code... Error 429")
# # print(results['error'])
# # time.sleep(10)
# except KeyError:
# neighbour_id.append('No product available')
return neighbour_id
#total_entities = response.headers["X-Total-Count"]
time_step = 4
for i in tqdm(range(len(pd.read_csv("data/df_Salling_Products_Neighbours.csv", sep=";")), len(df2), time_step)):
neighbor_list = []
p_id_list = []
time.sleep(time_step * 0.2)
p_id_list.append(list(df2[i:i+time_step]))
results = Parallel(n_jobs=time_step)(delayed(get_neighbours)(product_id) for product_id in df2[i:i+time_step])
for neighbours in results:
neighbor_list.append(neighbours)
df_neighbor["product_id"] = sum(p_id_list, [])
df_neighbor["neighbor_products_id"] = neighbor_list
df_neighbor.to_csv('data/df_Salling_Products_Neighbours.csv', sep=';', mode='a', index=False, header=False)
2.2 Preprocessing and Cleaning ¶
In this section the data is filtered and cleaned to prepare it to be used in the analysis. The preprocessing is handeled in 3 steps:
- Product data (ids, names, prices, categories)
- Textual data (product descriptions)
- Neighbors data (frequently bought together)
The pandas library is used to read the csv files for the webscraped data:
df_products = pd.read_csv('data/df_Salling_Products.csv' , delimiter=";") # Read the csv file containing the products_id
df_text = pd.read_csv('data/df_Salling_Products_Descriptions_CLEANED.csv' , delimiter=";") # Read the csv file containing the descriptions
df_neighbours = pd.read_csv('data/df_Salling_Products_Neighbours.csv', delimiter=";") # Read the csv file containing the neighbours to a specific product_id
# Cleaning product id data by stripping unnecessary characters ect.
df_products[['price_amount', 'unit']] = df_products['price'].str.split('/', expand=True) # Cleaning price attribute
df_products['price_amount'] = df_products['price_amount'].str.replace(',', '.').str.extract('(\d+.\d+)').astype(float) # Cleaning price attribute
df_products['category'] = df_products['category'].str.replace('/','') # Cleaning category attribute
df_products.drop(columns=['Unnamed: 0'], inplace=True) # Drop the original price column
#Distribute the 21 different inner categories into 3 main categories: Foods, House and Other
foods = ['frugt-og-groent', 'koed-og-fisk', 'mejeri-og-koel', 'drikkevarer', 'broed-og-kager','kolonial', 'slik-og-snacks', 'frost']
house = ['husholdning', 'personlig-pleje', 'baby-og-boern', 'bolig-og-koekken', 'fritid-og-sport', 'toej-og-sko', 'have', 'leg', 'byggemarked']
other = ['dyremad','elektronik','biludstyr', 'kiosk']
df_products['outer_category'] = df_products['category'].map(lambda x: 'Foods' if x in foods else ('House' if x in house else 'Other'))
df_products.head()
| product_id | name | price | link | category | price_amount | unit | outer_category | |
|---|---|---|---|---|---|---|---|---|
| 0 | 18381 | Bananer | 2,75/Stk. | /produkt/bananer/18381/ | frugt-og-groent | 2.75 | Stk. | Foods |
| 1 | 51061 | Peberfrugter røde | 7,25/Stk. | /produkt/peberfrugter-roede/51061/ | frugt-og-groent | 7.25 | Stk. | Foods |
| 2 | 61090 | Agurk øko | 9,00/Stk. | /produkt/salling-oeko-agurk-oeko/61090/ | frugt-og-groent | 9.00 | Stk. | Foods |
| 3 | 72008 | Bananer 4 pak øko | 2,20/Stk. | /produkt/bananer-4-pak-oeko/72008/ | frugt-og-groent | 2.20 | Stk. | Foods |
| 4 | 18323 | Gulerødder | 10,00/Kg. | /produkt/salling-guleroedder/18323/ | frugt-og-groent | 10.00 | Kg. | Foods |
Distribution of the categories of the webscraped products.
alt.data_transformers.disable_max_rows()
domain=['Foods', 'House', 'Other']
crange=['#2ca02c','#1f77b4', '#ff7f0e', ]
all_chart = alt.Chart(df_products).mark_bar().encode(
x='count():Q',
y=alt.Y('category:N', sort='-x', title='Inner Category'),
color=alt.Color('outer_category:N',scale=alt.Scale(domain=domain, range=crange),title='Outer Category'),
tooltip=['category', 'outer_category', 'count()']
).properties(
title='Distribution of the products in the different categories',
width=300,
height=300
)#.configure(background='transparent')
df_filtered = df_products[df_products['outer_category'] == 'Foods']
foods_chart = alt.Chart(df_filtered).mark_bar().encode(
x='count():Q',
y=alt.Y('category:N', sort='-x', title='Inner Categories'),
color=alt.Color('outer_category:N',scale=alt.Scale(domain=domain, range=crange),title='Outer Category'),
tooltip=['category', 'outer_category', 'count()']
).properties(
title='Distribution of the products in the Foods category only',
width=300,
height=300
)#.configure(background='transparent')
concated = (all_chart | foods_chart).interactive()#.configure(background='transparent')
concated
#concated.save('images/concated_chart_categories.png', scale_factor=2.0)